The goal of this project is to develop a study of the 'COVID.csv' dataset, a database which contains information about COVID cases. From symptom diagnosis and other informations about the patients, we'll develop a model to predict confirmed COVID cases.
The variables in the dataset represent:
We'll develop the project according to the following steps:
Let's import the libraries we'll use in this step of the project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("COVID.csv")
df
| Unnamed: 0 | sex | patient_type | intubed | pneumonia | age | pregnancy | diabetes | copd | asthma | inmsupr | hypertension | other_disease | cardiovascular | obesity | renal_chronic | tobacco | contact_other_covid | covid_res | icu | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | NaN | 0.0 | 27 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | NaN |
| 1 | 1 | 0 | 1 | NaN | 0.0 | 24 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 1 | NaN |
| 2 | 2 | 1 | 0 | 0.0 | 0.0 | 54 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | NaN | 1 | 0.0 |
| 3 | 3 | 0 | 0 | 0.0 | 1.0 | 30 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 1 | 0.0 |
| 4 | 4 | 1 | 0 | 0.0 | 0.0 | 60 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | NaN | 1 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 499687 | 499687 | 0 | 1 | NaN | 1.0 | 77 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0 | NaN |
| 499688 | 499688 | 0 | 0 | 1.0 | 1.0 | 63 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0 | 0.0 |
| 499689 | 499689 | 1 | 1 | NaN | 0.0 | 25 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | NaN |
| 499690 | 499690 | 1 | 1 | NaN | 0.0 | 45 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | NaN |
| 499691 | 499691 | 1 | 1 | NaN | 0.0 | 51 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0 | NaN |
499692 rows × 20 columns
We'll drop column "Unnamed: 0" because it's a copy of the index
df = df.drop(columns="Unnamed: 0")
df
| sex | patient_type | intubed | pneumonia | age | pregnancy | diabetes | copd | asthma | inmsupr | hypertension | other_disease | cardiovascular | obesity | renal_chronic | tobacco | contact_other_covid | covid_res | icu | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | NaN | 0.0 | 27 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | NaN |
| 1 | 0 | 1 | NaN | 0.0 | 24 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 1 | NaN |
| 2 | 1 | 0 | 0.0 | 0.0 | 54 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | NaN | 1 | 0.0 |
| 3 | 0 | 0 | 0.0 | 1.0 | 30 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 1 | 0.0 |
| 4 | 1 | 0 | 0.0 | 0.0 | 60 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | NaN | 1 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 499687 | 0 | 1 | NaN | 1.0 | 77 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0 | NaN |
| 499688 | 0 | 0 | 1.0 | 1.0 | 63 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0 | 0.0 |
| 499689 | 1 | 1 | NaN | 0.0 | 25 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | NaN |
| 499690 | 1 | 1 | NaN | 0.0 | 45 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | NaN |
| 499691 | 1 | 1 | NaN | 0.0 | 51 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0 | NaN |
499692 rows × 19 columns
Let's observe the types of data we have and whether there are missing data:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 499692 entries, 0 to 499691 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sex 499692 non-null int64 1 patient_type 499692 non-null int64 2 intubed 107424 non-null float64 3 pneumonia 499681 non-null float64 4 age 499692 non-null int64 5 pregnancy 245258 non-null float64 6 diabetes 498051 non-null float64 7 copd 498246 non-null float64 8 asthma 498250 non-null float64 9 inmsupr 498030 non-null float64 10 hypertension 498203 non-null float64 11 other_disease 497499 non-null float64 12 cardiovascular 498183 non-null float64 13 obesity 498222 non-null float64 14 renal_chronic 498216 non-null float64 15 tobacco 498113 non-null float64 16 contact_other_covid 346017 non-null float64 17 covid_res 499692 non-null int64 18 icu 107423 non-null float64 dtypes: float64(15), int64(4) memory usage: 72.4 MB
There are many missing entries in many columns.\ However, some appear to be complete ("sex", "patient_type","age","covid_res")\ We can use some of the data in these columns to complete some of columns with missing data:
When we observe the column "pregnancy", many of the missing entries coincide with sex 0 (Man):
print("A total of 253098 lines correspond to male patients:")
display(df[df["sex"]==0]["pregnancy"])
print("Missing data in \"pregnancy\" when \"sex\" is Man:",df[df["sex"]==0]["pregnancy"].isna().sum())
print("Missing data in \"pregnancy\":",df["pregnancy"].isna().sum())
A total of 253098 lines correspond to male patients:
0 NaN
1 NaN
3 NaN
5 NaN
6 NaN
..
499679 NaN
499681 NaN
499683 NaN
499687 NaN
499688 NaN
Name: pregnancy, Length: 253098, dtype: float64
Missing data in "pregnancy" when "sex" is Man: 253098 Missing data in "pregnancy": 254434
Therefore, filling these lines does not fill the whole column but it comes very close to filling it.
We'll the "pregnancy" missing values with a 0 when "sex" is Man:
df["pregnancy"] = np.where(df["sex"] == 0, 0, df["pregnancy"])
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 499692 entries, 0 to 499691 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sex 499692 non-null int64 1 patient_type 499692 non-null int64 2 intubed 107424 non-null float64 3 pneumonia 499681 non-null float64 4 age 499692 non-null int64 5 pregnancy 498356 non-null float64 6 diabetes 498051 non-null float64 7 copd 498246 non-null float64 8 asthma 498250 non-null float64 9 inmsupr 498030 non-null float64 10 hypertension 498203 non-null float64 11 other_disease 497499 non-null float64 12 cardiovascular 498183 non-null float64 13 obesity 498222 non-null float64 14 renal_chronic 498216 non-null float64 15 tobacco 498113 non-null float64 16 contact_other_covid 346017 non-null float64 17 covid_res 499692 non-null int64 18 icu 107423 non-null float64 dtypes: float64(15), int64(4) memory usage: 72.4 MB
The columns "intubed" and "icu" were not filled for patients who were not hospitalized:
print("Patients who were sent home (1) and patients who were hospitalized (0):")
df["patient_type"].value_counts()
Patients who were sent home (1) and patients who were hospitalized (0):
1 392146 0 107546 Name: patient_type, dtype: int64
print("Missing data in the \"intubed\" column for patients who were sent home:")
df[df["patient_type"]==1]["intubed"].isna().sum()
Missing data in the "intubed" column for patients who were sent home:
392146
print("Missing data in the \"icu\" column for patients who were sent home:")
df[df["patient_type"]==1]["icu"].isna().sum()
Missing data in the "icu" column for patients who were sent home:
392146
Therefore, we'll fill these values with a 0 (the patient was not intubed or sent to the ICU):
df["intubed"] = np.where(df["patient_type"] == 1, 0, df["intubed"])
df["icu"] = np.where(df["patient_type"] == 1, 0, df["icu"])
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 499692 entries, 0 to 499691 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sex 499692 non-null int64 1 patient_type 499692 non-null int64 2 intubed 499570 non-null float64 3 pneumonia 499681 non-null float64 4 age 499692 non-null int64 5 pregnancy 498356 non-null float64 6 diabetes 498051 non-null float64 7 copd 498246 non-null float64 8 asthma 498250 non-null float64 9 inmsupr 498030 non-null float64 10 hypertension 498203 non-null float64 11 other_disease 497499 non-null float64 12 cardiovascular 498183 non-null float64 13 obesity 498222 non-null float64 14 renal_chronic 498216 non-null float64 15 tobacco 498113 non-null float64 16 contact_other_covid 346017 non-null float64 17 covid_res 499692 non-null int64 18 icu 499569 non-null float64 dtypes: float64(15), int64(4) memory usage: 72.4 MB
There doesn't seem to be a good relation between the missing data in this column and the data in other columns that would allow us to complete this column the way we completed the other ones.
Considering the nature of the data in this column (the mere opinion of the patient?) and the fact that it would be a lot of missing data, we've decided to work without this column rather than create uncertain data to fill it.
df=df.drop(columns="contact_other_covid")
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 499692 entries, 0 to 499691 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sex 499692 non-null int64 1 patient_type 499692 non-null int64 2 intubed 499570 non-null float64 3 pneumonia 499681 non-null float64 4 age 499692 non-null int64 5 pregnancy 498356 non-null float64 6 diabetes 498051 non-null float64 7 copd 498246 non-null float64 8 asthma 498250 non-null float64 9 inmsupr 498030 non-null float64 10 hypertension 498203 non-null float64 11 other_disease 497499 non-null float64 12 cardiovascular 498183 non-null float64 13 obesity 498222 non-null float64 14 renal_chronic 498216 non-null float64 15 tobacco 498113 non-null float64 16 covid_res 499692 non-null int64 17 icu 499569 non-null float64 dtypes: float64(14), int64(4) memory usage: 68.6 MB
We have these remaining missing entries:
df.isna().sum()
sex 0 patient_type 0 intubed 122 pneumonia 11 age 0 pregnancy 1336 diabetes 1641 copd 1446 asthma 1442 inmsupr 1662 hypertension 1489 other_disease 2193 cardiovascular 1509 obesity 1470 renal_chronic 1476 tobacco 1579 covid_res 0 icu 123 dtype: int64
df = df.dropna()
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 494948 entries, 0 to 499691 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sex 494948 non-null int64 1 patient_type 494948 non-null int64 2 intubed 494948 non-null float64 3 pneumonia 494948 non-null float64 4 age 494948 non-null int64 5 pregnancy 494948 non-null float64 6 diabetes 494948 non-null float64 7 copd 494948 non-null float64 8 asthma 494948 non-null float64 9 inmsupr 494948 non-null float64 10 hypertension 494948 non-null float64 11 other_disease 494948 non-null float64 12 cardiovascular 494948 non-null float64 13 obesity 494948 non-null float64 14 renal_chronic 494948 non-null float64 15 tobacco 494948 non-null float64 16 covid_res 494948 non-null int64 17 icu 494948 non-null float64 dtypes: float64(14), int64(4) memory usage: 71.7 MB
df.duplicated().sum()
447456
Despite the amount of duplicated data being the largest part of our dataset, I've decided to drop this data to avoid creating problems for our model.
df = df[~df.duplicated()]
df
| sex | patient_type | intubed | pneumonia | age | pregnancy | diabetes | copd | asthma | inmsupr | hypertension | other_disease | cardiovascular | obesity | renal_chronic | tobacco | covid_res | icu | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 0.0 | 0.0 | 27 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 0.0 |
| 1 | 0 | 1 | 0.0 | 0.0 | 24 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 0.0 |
| 2 | 1 | 0 | 0.0 | 0.0 | 54 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1 | 0.0 |
| 3 | 0 | 0 | 0.0 | 1.0 | 30 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 0.0 |
| 4 | 1 | 0 | 0.0 | 0.0 | 60 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 499574 | 0 | 0 | 0.0 | 1.0 | 55 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0.0 |
| 499575 | 1 | 0 | 0.0 | 1.0 | 32 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0.0 |
| 499606 | 0 | 0 | 0.0 | 1.0 | 84 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0 | 0.0 |
| 499613 | 1 | 0 | 1.0 | 0.0 | 23 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0.0 |
| 499687 | 0 | 1 | 0.0 | 1.0 | 77 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0.0 |
47492 rows × 18 columns
Since we only have numeric data, and they're all integers, we can avoid floating point numbers.
df.describe()
| sex | patient_type | intubed | pneumonia | age | pregnancy | diabetes | copd | asthma | inmsupr | hypertension | other_disease | cardiovascular | obesity | renal_chronic | tobacco | covid_res | icu | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 47492.000000 | 47492.000000 | 47492.000000 | 47492.000000 | 47492.000000 | 47492.000000 | 47492.000000 | 47492.000000 | 47492.000000 | 47492.000000 | 47492.000000 | 47492.000000 | 47492.000000 | 47492.00000 | 47492.000000 | 47492.000000 | 47492.000000 | 47492.000000 |
| mean | 0.478481 | 0.402952 | 0.100670 | 0.442727 | 53.913880 | 0.017582 | 0.376821 | 0.133959 | 0.112545 | 0.120673 | 0.456561 | 0.164744 | 0.162343 | 0.33593 | 0.143603 | 0.214162 | 0.501790 | 0.106586 |
| std | 0.499542 | 0.490496 | 0.300894 | 0.496714 | 20.371761 | 0.131427 | 0.484595 | 0.340612 | 0.316039 | 0.325750 | 0.498115 | 0.370953 | 0.368769 | 0.47232 | 0.350691 | 0.410244 | 0.500002 | 0.308590 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 40.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 55.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 |
| 75% | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 69.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1.00000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 |
| max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 120.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
Observing the minimum and maximum values of the data in all columns, int8 can store these numbers.
df = df.astype("int8")
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 47492 entries, 0 to 499687 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sex 47492 non-null int8 1 patient_type 47492 non-null int8 2 intubed 47492 non-null int8 3 pneumonia 47492 non-null int8 4 age 47492 non-null int8 5 pregnancy 47492 non-null int8 6 diabetes 47492 non-null int8 7 copd 47492 non-null int8 8 asthma 47492 non-null int8 9 inmsupr 47492 non-null int8 10 hypertension 47492 non-null int8 11 other_disease 47492 non-null int8 12 cardiovascular 47492 non-null int8 13 obesity 47492 non-null int8 14 renal_chronic 47492 non-null int8 15 tobacco 47492 non-null int8 16 covid_res 47492 non-null int8 17 icu 47492 non-null int8 dtypes: int8(18) memory usage: 1.2 MB
We'll start with Descriptive Statistics to analyse the basic features of the data:
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| sex | 47492.0 | 0.478481 | 0.499542 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| patient_type | 47492.0 | 0.402952 | 0.490496 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| intubed | 47492.0 | 0.100670 | 0.300894 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| pneumonia | 47492.0 | 0.442727 | 0.496714 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| age | 47492.0 | 53.913880 | 20.371761 | 0.0 | 40.0 | 55.0 | 69.0 | 120.0 |
| pregnancy | 47492.0 | 0.017582 | 0.131427 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| diabetes | 47492.0 | 0.376821 | 0.484595 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| copd | 47492.0 | 0.133959 | 0.340612 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| asthma | 47492.0 | 0.112545 | 0.316039 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| inmsupr | 47492.0 | 0.120673 | 0.325750 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| hypertension | 47492.0 | 0.456561 | 0.498115 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| other_disease | 47492.0 | 0.164744 | 0.370953 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| cardiovascular | 47492.0 | 0.162343 | 0.368769 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| obesity | 47492.0 | 0.335930 | 0.472320 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| renal_chronic | 47492.0 | 0.143603 | 0.350691 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| tobacco | 47492.0 | 0.214162 | 0.410244 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| covid_res | 47492.0 | 0.501790 | 0.500002 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| icu | 47492.0 | 0.106586 | 0.308590 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
Let's check the distribution of the target variable:
df["covid_res"].value_counts()
1 23831 0 23661 Name: covid_res, dtype: int64
plt.figure(figsize=(7,7))
palette ={0: "C0", 1: "C3"}
figura = sns.countplot(x=df["covid_res"], palette=palette)
plt.title('Distribution of the COVID Test Results', fontsize=20)
plt.ylabel("Count", fontsize=15)
plt.yticks(fontsize=12)
xticks=[(p.get_x() + p.get_width() / 2) for p in figura.patches]
plt.xticks(ticks = xticks, labels = ["Negative", "Positive"], fontsize=15)
plt.xlabel("Value", fontsize=15);
plt.show();
The target variable is almost perfectly balanced. We can use Undersampling, Oversampling and SMOTE to see if the results of the classification vary but they probably won't change much.
Let's group the data with the values of the target variable to see how each feature behaves:
df.groupby("covid_res").mean()
| sex | patient_type | intubed | pneumonia | age | pregnancy | diabetes | copd | asthma | inmsupr | hypertension | other_disease | cardiovascular | obesity | renal_chronic | tobacco | icu | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| covid_res | |||||||||||||||||
| 0 | 0.490470 | 0.432526 | 0.073792 | 0.390178 | 52.694814 | 0.019441 | 0.355987 | 0.142936 | 0.118761 | 0.141161 | 0.444022 | 0.182494 | 0.174929 | 0.305989 | 0.153459 | 0.224758 | 0.090613 |
| 1 | 0.466577 | 0.373589 | 0.127355 | 0.494902 | 55.124250 | 0.015736 | 0.397507 | 0.125047 | 0.106374 | 0.100332 | 0.469011 | 0.147119 | 0.149847 | 0.365658 | 0.133817 | 0.203642 | 0.122446 |
The relationship above shows how the mean value for most features is higher for patients whose COVID test resulted negative ("covid_res"=0). Since all features have binary values, a feature whose mean is higher than the mean of another feature implies it has more ones than the other feature. For the "age" feature, the average age of people whose test resulted negative was smaller than the average age of people whose test resulted positive.
Let's plot a correlation matrix and see if it shows any variables with a high correlation among them:
correlation_matrix=df.corr()
plt.figure(figsize=(20,20))
sns.heatmap(correlation_matrix,cbar=True,fmt=".1f",annot=True,cmap="icefire")
plt.show()
Let's check the distribution of the "age" feature, since it's the only feature which has values outside the range of zero and one.
plt.figure(figsize = (7,7))
figure = sns.histplot(data = df , x = "age", bins = 30)
plt.title("Histogram of the Age of the Patients", fontsize = 20)
plt.ylabel("Count", fontsize = 15)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)
plt.xlabel("Age (years)", fontsize = 15);
plt.show();
There are a lot of people with age zero (probably infants) and there are some people with age over 100. Most people seem to be between 40 and 80 years old. We'll plot a boxplot to check the if we should consider some of the patients as outliers.
import plotly.express as px # We'll plot this one with Plotly (Interactive chart)
fig = px.box(data_frame = df, x = "age")
fig.show()
We'll filter the entries above the upper fence (111 years):
outliers = df[df["age"]>111].shape[0]
relevance = outliers/ df.shape[0]
print("Total outliers for \"age\":", outliers)
print(f"Rate of outliers: {relevance*100:.2f} %")
Total outliers for "age": 13 Rate of outliers: 0.03 %
Despite the fact that it's such a small number of entries, we'll remove them. If we were building a pipeline to treat our data after new, unseen data is added to out dataset, we would want the final dataset to not contain outliers as well.
df = df[df["age"]<=111]
Importing the libraries used in this step:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from mlxtend.plotting import plot_confusion_matrix
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_curve, roc_auc_score, precision_score, recall_score, f1_score
from lightgbm.sklearn import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier,AdaBoostClassifier,RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier
Separating the data between X (features) and y (target):
X=df.drop(columns="covid_res")
y=df["covid_res"]
As seen before, the target feature is well balanced:
y.value_counts()
1 23827 0 23652 Name: covid_res, dtype: int64
We'll separate the database between training data and test data and use the training data to pick the best model. We'll use the parameter stratify, despite the target being almost perfectly balanced.
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size = 0.3,
random_state = 42,stratify=y)
Let's define a function to test the models and return a few metrics:
def test_models_plot_roc_auc_curve(
model_list,
X_train,
X_test,
y_train,
y_test):
"""
model_list: List of the models to be tested. A list with dictionaries.
Example: [{"model name":"Logistic Regression","estimator":LogisticRegression()}]
X_train: Training Data (features)
X_test: Test Data (features)
y_train: Training Data (target)
y_test: Test Data (target)
"""
plt.figure(figsize=(15, 15))
response = {}
for mdl in model_list:
model = mdl.get("estimator")
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:,1])
model_name = mdl.get("model_name")
accuracy = accuracy_score(y_test, y_predict)
auc = roc_auc_score(y_test, y_predict)
precision = precision_score(y_test, y_predict, average="weighted")
recall = recall_score(y_test, y_predict, average="weighted")
f1 = f1_score(y_test, y_predict, average="weighted")
plt.plot(fpr, tpr, label="%s ROC (AUC = %0.2f)" % (mdl.get("model_name"), auc))
print(f"Model : {model_name}")
print(f"Accuracy : {accuracy} ")
print(f"Precision : {precision}")
print(f"Recall : {recall}")
print(f"F1 - Score : {f1} ")
print(f"ROC - AUC : {auc} ")
print("======================")
response[mdl.get('model_name')] = {
"accuracy": accuracy,
"precision": precision,
"recall": recall,
"f1_score": f1_score,
"auc": auc,
}
plt.plot([0, 1], [0, 1],"r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positives",fontsize = 15)
plt.ylabel("True Positives",fontsize = 15)
plt.title("ROC-AUC curve",fontsize = 15)
plt.legend(loc = "lower right")
plt.show()
return response
List with the models we'll test and the random seed
random_seed = 42
list_models = [
{"model_name": "Logistic Regression",
"estimator": LogisticRegression(random_state = random_seed)
},
{
"model_name": "Decision Tree",
"estimator": DecisionTreeClassifier(random_state = random_seed)
},
{
"model_name": "Random Forest",
"estimator": RandomForestClassifier(random_state = random_seed)
},
{
"model_name": "AdaBoost",
"estimator": AdaBoostClassifier(random_state = random_seed)
},
{
"model_name": "GradientBoosting",
"estimator": GradientBoostingClassifier(random_state = random_seed)
},
{
"model_name": "XGBoost",
"estimator": XGBClassifier(random_state = random_seed, use_label_encoder=False, eval_metric="logloss" )
},
{
"model_name": "LightGBM",
"estimator": LGBMClassifier(random_state = random_seed)
},
{
"model_name": "CatBoost",
"estimator": CatBoostClassifier(random_state = random_seed, task_type = "GPU", verbose = False)
}
]
Running the function to test the models:
test_models_plot_roc_auc_curve(
list_models,
X_train,
X_test,
y_train,
y_test
);
Model : Logistic Regression Accuracy : 0.5732238135355238 Precision : 0.5732501066375602 Recall : 0.5732238135355238 F1 - Score : 0.5732193268186578 ROC - AUC : 0.5732383732190838 ====================== Model : Decision Tree Accuracy : 0.35313114293737713 Precision : 0.35191852082214625 Recall : 0.35313114293737713 F1 - Score : 0.35166463820664423 ROC - AUC : 0.3533029555811135 ====================== Model : Random Forest Accuracy : 0.36794439764111203 Precision : 0.36793087003200764 Recall : 0.36794439764111203 F1 - Score : 0.36793565908485865 ROC - AUC : 0.36792751608920493 ====================== Model : AdaBoost Accuracy : 0.5776467284470654 Precision : 0.5776554119732427 Recall : 0.5776467284470654 F1 - Score : 0.5775648482341604 ROC - AUC : 0.5775965036853286 ====================== Model : GradientBoosting Accuracy : 0.5750491434990171 Precision : 0.5750411350446818 Recall : 0.5750491434990171 F1 - Score : 0.5750308511017654 ROC - AUC : 0.5750252828110323 ====================== Model : XGBoost Accuracy : 0.5409295141814097 Precision : 0.5410012254600103 Recall : 0.5409295141814097 F1 - Score : 0.5403577869032876 ROC - AUC : 0.5408011417799478 ====================== Model : LightGBM Accuracy : 0.5655714686885707 Precision : 0.5655701250014956 Recall : 0.5655714686885707 F1 - Score : 0.5654939652930497 ROC - AUC : 0.5655231333777899 ====================== Model : CatBoost Accuracy : 0.579331648413367 Precision : 0.5793765631093905 Recall : 0.579331648413367 F1 - Score : 0.5793169935869472 ROC - AUC : 0.5793552599287476 ======================
Let's run a Normalization to see which models benefit from this technique. Despite the fact we have tree models, which in theory don't benefit from scaling, some models may benefit a little.
std=MinMaxScaler()
X_train_std=std.fit_transform(X_train)
X_test_std=std.transform(X_test)
Testing the models once again:
test_models_plot_roc_auc_curve(
list_models,
X_train_std,
X_test_std,
y_train,
y_test
);
Model : Logistic Regression Accuracy : 0.5731536085369279 Precision : 0.5731812490798265 Recall : 0.5731536085369279 F1 - Score : 0.5731484331274378 ROC - AUC : 0.5731689361788036 ====================== Model : Decision Tree Accuracy : 0.35256950294860995 Precision : 0.35138197443714353 Recall : 0.35256950294860995 F1 - Score : 0.35113628251490914 ROC - AUC : 0.35273925772316533 ====================== Model : Random Forest Accuracy : 0.36850603762987927 Precision : 0.3684925315808287 Recall : 0.36850603762987927 F1 - Score : 0.368497306838643 ROC - AUC : 0.3684891635632266 ====================== Model : AdaBoost Accuracy : 0.5776467284470654 Precision : 0.5776554119732427 Recall : 0.5776467284470654 F1 - Score : 0.5775648482341604 ROC - AUC : 0.5775965036853286 ====================== Model : GradientBoosting Accuracy : 0.5750491434990171 Precision : 0.5750411350446818 Recall : 0.5750491434990171 F1 - Score : 0.5750308511017654 ROC - AUC : 0.5750252828110323 ====================== Model : XGBoost Accuracy : 0.5409295141814097 Precision : 0.5410012254600103 Recall : 0.5409295141814097 F1 - Score : 0.5403577869032876 ROC - AUC : 0.5408011417799478 ====================== Model : LightGBM Accuracy : 0.5655714686885707 Precision : 0.5655701250014956 Recall : 0.5655714686885707 F1 - Score : 0.5654939652930497 ROC - AUC : 0.5655231333777899 ====================== Model : CatBoost Accuracy : 0.5784891884302162 Precision : 0.5785409891476477 Recall : 0.5784891884302162 F1 - Score : 0.5784698171298565 ROC - AUC : 0.5785158642936048 ======================
About the evaluation metrics obtained and which model we'll work with:
Our target variable, "covid_res", tells us whether a patient tested positive for COVID (1) or negative (0);
When our model labels a False Positive, we're saying a patient who tested positive is sick. When it labels a False Negative, we're saying a patient who tested positive is not sick;
False Negatives are much worse/more dangerous because we're telling patients their health is fine when, in fact, they should seek treatment;
Therefore, we want to work with a model which can minimize False Negatives.
This happens when a model has high Recall (Recall is inversely proportional to how many False Negatives our model labels).
Considering all this, we'll work with the model CatBoost, which obtained the best Recall.
It's interesting to note that the models Decision Tree and Random Forest performed very badly (with metrics below 0.5, meaning it's under the probability of labeling the entries correctly by simply guessing the labels).
The Normalization affected only the models Logistic Regression, Decision Tree, Random Forest and CatBoost, although the changes in the evaluation metrics were marginal. Still, it makes more sense to scale the data.
Despite the fact the target variable is already well balanced, let's see if the model has an improvement in performance when we use Undersampling, Oversampling and SMOTE.
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler, SMOTE
model = CatBoostClassifier(random_state = random_seed, task_type = "GPU", verbose = False)
model.fit (X_train_std, y_train)
y_pred = model.predict(X_test_std)
print(classification_report(y_test,y_pred))
precision recall f1-score support
0 0.58 0.59 0.58 7096
1 0.58 0.57 0.58 7148
accuracy 0.58 14244
macro avg 0.58 0.58 0.58 14244
weighted avg 0.58 0.58 0.58 14244
cm = confusion_matrix(y_test, y_pred)
plot_confusion_matrix(conf_mat=cm,cmap="coolwarm_r")
plt.show()
undersample = RandomUnderSampler(sampling_strategy='majority',random_state=42)
X_train_un, y_train_un = undersample.fit_resample(X_train_std, y_train)
y_train.value_counts()
1 16679 0 16556 Name: covid_res, dtype: int64
y_train_un.value_counts()
0 16556 1 16556 Name: covid_res, dtype: int64
model = CatBoostClassifier(random_state = random_seed, task_type = "GPU", verbose = False)
model.fit(X_train_un, y_train_un)
y_pred_un = model.predict(X_test_std)
print(classification_report(y_test,y_pred_un))
precision recall f1-score support
0 0.57 0.59 0.58 7096
1 0.58 0.56 0.57 7148
accuracy 0.58 14244
macro avg 0.58 0.58 0.58 14244
weighted avg 0.58 0.58 0.58 14244
cm = confusion_matrix(y_test, y_pred_un)
plot_confusion_matrix(conf_mat=cm,cmap="coolwarm_r")
plt.show()
oversample = RandomOverSampler(sampling_strategy='minority',random_state=42)
X_train_ov, y_train_ov = oversample.fit_resample(X_train_std, y_train)
y_train.value_counts()
1 16679 0 16556 Name: covid_res, dtype: int64
y_train_ov.value_counts()
0 16679 1 16679 Name: covid_res, dtype: int64
model = CatBoostClassifier(random_state = random_seed, task_type = "GPU", verbose = False)
model.fit(X_train_ov, y_train_ov)
<catboost.core.CatBoostClassifier at 0x1e9c9f7cd30>
y_pred_ov = model.predict(X_test_std)
print(classification_report(y_test,y_pred_ov))
precision recall f1-score support
0 0.58 0.60 0.59 7096
1 0.58 0.56 0.57 7148
accuracy 0.58 14244
macro avg 0.58 0.58 0.58 14244
weighted avg 0.58 0.58 0.58 14244
cm = confusion_matrix(y_test, y_pred_ov)
plot_confusion_matrix(conf_mat=cm,cmap="coolwarm_r")
plt.show()
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_sm, y_train_sm = smote.fit_resample(X_train_std, y_train)
y_train.value_counts()
1 16679 0 16556 Name: covid_res, dtype: int64
y_train_sm.value_counts()
0 16679 1 16679 Name: covid_res, dtype: int64
model.fit(X_train_sm, y_train_sm)
y_pred_sm = model.predict(X_test_std)
print(classification_report(y_test, y_pred_sm))
precision recall f1-score support
0 0.57 0.59 0.58 7096
1 0.58 0.56 0.57 7148
accuracy 0.58 14244
macro avg 0.58 0.58 0.58 14244
weighted avg 0.58 0.58 0.58 14244
cm = confusion_matrix(y_test, y_pred_sm)
plot_confusion_matrix(conf_mat=cm,cmap="coolwarm_r")
plt.show()
print('Baseline Model')
print(classification_report(y_test, y_pred))
print('\nOversampling')
print(classification_report(y_test, y_pred_ov))
print('\nUndersampling')
print(classification_report(y_test, y_pred_un))
print('\nSMOTE')
print(classification_report(y_test, y_pred_sm))
Baseline Model
precision recall f1-score support
0 0.58 0.59 0.58 7096
1 0.58 0.57 0.58 7148
accuracy 0.58 14244
macro avg 0.58 0.58 0.58 14244
weighted avg 0.58 0.58 0.58 14244
Oversampling
precision recall f1-score support
0 0.58 0.60 0.59 7096
1 0.58 0.56 0.57 7148
accuracy 0.58 14244
macro avg 0.58 0.58 0.58 14244
weighted avg 0.58 0.58 0.58 14244
Undersampling
precision recall f1-score support
0 0.57 0.59 0.58 7096
1 0.58 0.56 0.57 7148
accuracy 0.58 14244
macro avg 0.58 0.58 0.58 14244
weighted avg 0.58 0.58 0.58 14244
SMOTE
precision recall f1-score support
0 0.57 0.59 0.58 7096
1 0.58 0.56 0.57 7148
accuracy 0.58 14244
macro avg 0.58 0.58 0.58 14244
weighted avg 0.58 0.58 0.58 14244
As expected, there were no big changes in the evaluation metrics because the data was already well balanced concerning the target variable. Let's use the data with no treatment because the Recall for class 1 is the best in this case.
Let's try to optimize the results of the CatBoost classifier through Cross Validation (RepeatedStratifiedKFold) and Hyperparameter Optimization (Optuna).
from sklearn.model_selection import cross_validate
import optuna
from optuna.integration import OptunaSearchCV
We'll define an objective function and the hyperparameters Optuna will try to maximize. We'll also use Cross Validation.
def objective_catboost(trial):
param = {
"learning_rate": trial.suggest_float("learning_rate", 0.01,0.1),
"n_estimators": trial.suggest_int("n_estimators", 20,100,20),
"max_depth": trial.suggest_int("max_depth", 1,5,1),
"l2_leaf_reg": trial.suggest_int("l2_leaf_reg",1,5,1),
"loss_function": trial.suggest_categorical("loss_function",["Logloss","CrossEntropy"]),
"eval_metric": trial.suggest_categorical("eval_metric",["Recall"]),
"early_stopping_rounds":trial.suggest_categorical("early_stopping_rounds",[10])} # Hyperparameters used
model = CatBoostClassifier(**param, random_state = 42, task_type = "GPU", verbose = False) # Model with the hyperparameters
return cross_validate(model,
X_train_std,
y_train,
scoring=["recall"],
cv = 5
)["test_recall"].mean() # For the score of Cross-Validation,
# we want the mean of the Recalls (we'll have one Recall
# for each fold)
catboost_study = optuna.create_study(direction="maximize", study_name = "CatBoost Classification")
catboost_study.optimize(objective_catboost, n_trials=100)
[I 2022-04-18 22:06:31,937] A new study created in memory with name: CatBoost Classification [I 2022-04-18 22:06:37,504] Trial 0 finished with value: 0.5379217405685647 and parameters: {'learning_rate': 0.04591540867770837, 'n_estimators': 60, 'max_depth': 5, 'l2_leaf_reg': 5, 'loss_function': 'CrossEntropy', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 0 with value: 0.5379217405685647. [I 2022-04-18 22:06:41,991] Trial 1 finished with value: 0.536003473083602 and parameters: {'learning_rate': 0.07510524138972424, 'n_estimators': 60, 'max_depth': 2, 'l2_leaf_reg': 1, 'loss_function': 'CrossEntropy', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 0 with value: 0.5379217405685647. [I 2022-04-18 22:06:47,840] Trial 2 finished with value: 0.5325852900887685 and parameters: {'learning_rate': 0.03959207455311473, 'n_estimators': 60, 'max_depth': 5, 'l2_leaf_reg': 4, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 0 with value: 0.5379217405685647. [I 2022-04-18 22:06:54,529] Trial 3 finished with value: 0.47574755787573836 and parameters: {'learning_rate': 0.011646026566293637, 'n_estimators': 100, 'max_depth': 2, 'l2_leaf_reg': 1, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 0 with value: 0.5379217405685647. [I 2022-04-18 22:06:59,992] Trial 4 finished with value: 0.5497325078467961 and parameters: {'learning_rate': 0.0921779685812905, 'n_estimators': 60, 'max_depth': 4, 'l2_leaf_reg': 2, 'loss_function': 'CrossEntropy', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 4 with value: 0.5497325078467961. [I 2022-04-18 22:07:06,736] Trial 5 finished with value: 0.5427182631705729 and parameters: {'learning_rate': 0.041842008008612076, 'n_estimators': 100, 'max_depth': 4, 'l2_leaf_reg': 5, 'loss_function': 'CrossEntropy', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 4 with value: 0.5497325078467961. [I 2022-04-18 22:07:11,285] Trial 6 finished with value: 0.4835415206066032 and parameters: {'learning_rate': 0.01479681595920436, 'n_estimators': 40, 'max_depth': 1, 'l2_leaf_reg': 1, 'loss_function': 'CrossEntropy', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 4 with value: 0.5497325078467961. [I 2022-04-18 22:07:14,930] Trial 7 finished with value: 0.5196958535120928 and parameters: {'learning_rate': 0.09814914376889766, 'n_estimators': 20, 'max_depth': 4, 'l2_leaf_reg': 2, 'loss_function': 'CrossEntropy', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 4 with value: 0.5497325078467961. [I 2022-04-18 22:07:19,557] Trial 8 finished with value: 0.5141791873847249 and parameters: {'learning_rate': 0.03767678146235692, 'n_estimators': 40, 'max_depth': 5, 'l2_leaf_reg': 4, 'loss_function': 'CrossEntropy', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 4 with value: 0.5497325078467961. [I 2022-04-18 22:07:25,863] Trial 9 finished with value: 0.5574070518697486 and parameters: {'learning_rate': 0.07591770844782274, 'n_estimators': 100, 'max_depth': 5, 'l2_leaf_reg': 3, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 9 with value: 0.5574070518697486. [I 2022-04-18 22:07:31,547] Trial 10 finished with value: 0.5503926813571631 and parameters: {'learning_rate': 0.06887063542955743, 'n_estimators': 100, 'max_depth': 3, 'l2_leaf_reg': 3, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 9 with value: 0.5574070518697486. [I 2022-04-18 22:07:37,772] Trial 11 finished with value: 0.550392843146772 and parameters: {'learning_rate': 0.06803648294503184, 'n_estimators': 100, 'max_depth': 3, 'l2_leaf_reg': 3, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 9 with value: 0.5574070518697486. [I 2022-04-18 22:07:42,766] Trial 12 finished with value: 0.5451166143546932 and parameters: {'learning_rate': 0.06867103966520215, 'n_estimators': 80, 'max_depth': 3, 'l2_leaf_reg': 3, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 9 with value: 0.5574070518697486. [I 2022-04-18 22:07:47,802] Trial 13 finished with value: 0.5441576154368859 and parameters: {'learning_rate': 0.08339901233718082, 'n_estimators': 80, 'max_depth': 2, 'l2_leaf_reg': 4, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 9 with value: 0.5574070518697486. [I 2022-04-18 22:07:53,447] Trial 14 finished with value: 0.5393011587731313 and parameters: {'learning_rate': 0.05974708406205076, 'n_estimators': 80, 'max_depth': 3, 'l2_leaf_reg': 2, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 9 with value: 0.5574070518697486. [I 2022-04-18 22:07:59,542] Trial 15 finished with value: 0.5485341322144682 and parameters: {'learning_rate': 0.056864829583830344, 'n_estimators': 100, 'max_depth': 4, 'l2_leaf_reg': 3, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 9 with value: 0.5574070518697486. [I 2022-04-18 22:08:04,413] Trial 16 finished with value: 0.5703569797834895 and parameters: {'learning_rate': 0.08276047400392048, 'n_estimators': 80, 'max_depth': 1, 'l2_leaf_reg': 4, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 16 with value: 0.5703569797834895. [I 2022-04-18 22:08:09,281] Trial 17 finished with value: 0.5699371177720491 and parameters: {'learning_rate': 0.0841500115628557, 'n_estimators': 80, 'max_depth': 1, 'l2_leaf_reg': 4, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 16 with value: 0.5703569797834895. [I 2022-04-18 22:08:14,709] Trial 18 finished with value: 0.5713161045376591 and parameters: {'learning_rate': 0.08686693109119631, 'n_estimators': 80, 'max_depth': 1, 'l2_leaf_reg': 4, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 18 with value: 0.5713161045376591. [I 2022-04-18 22:08:19,214] Trial 19 finished with value: 0.5692180348674583 and parameters: {'learning_rate': 0.09909915172050988, 'n_estimators': 80, 'max_depth': 1, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 18 with value: 0.5713161045376591. [I 2022-04-18 22:08:23,284] Trial 20 finished with value: 0.5766523213213537 and parameters: {'learning_rate': 0.08694831871052709, 'n_estimators': 40, 'max_depth': 1, 'l2_leaf_reg': 4, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 20 with value: 0.5766523213213537. [I 2022-04-18 22:08:27,080] Trial 21 finished with value: 0.580848964007205 and parameters: {'learning_rate': 0.0872781551699067, 'n_estimators': 40, 'max_depth': 1, 'l2_leaf_reg': 4, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 21 with value: 0.580848964007205. [I 2022-04-18 22:08:31,911] Trial 22 finished with value: 0.5351036172561201 and parameters: {'learning_rate': 0.08982432702204743, 'n_estimators': 40, 'max_depth': 2, 'l2_leaf_reg': 4, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 21 with value: 0.580848964007205. [I 2022-04-18 22:08:35,604] Trial 23 finished with value: 0.48516089077763275 and parameters: {'learning_rate': 0.09190093021995817, 'n_estimators': 20, 'max_depth': 1, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 21 with value: 0.580848964007205. [I 2022-04-18 22:08:39,641] Trial 24 finished with value: 0.5457714847612165 and parameters: {'learning_rate': 0.07784536388674457, 'n_estimators': 40, 'max_depth': 1, 'l2_leaf_reg': 4, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 21 with value: 0.580848964007205. [I 2022-04-18 22:08:43,253] Trial 25 finished with value: 0.47490822933856813 and parameters: {'learning_rate': 0.028627929456096476, 'n_estimators': 20, 'max_depth': 2, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 21 with value: 0.580848964007205. [I 2022-04-18 22:08:47,327] Trial 26 finished with value: 0.5671741287629566 and parameters: {'learning_rate': 0.08941669156742427, 'n_estimators': 40, 'max_depth': 1, 'l2_leaf_reg': 4, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 21 with value: 0.580848964007205. [I 2022-04-18 22:08:51,759] Trial 27 finished with value: 0.5353432276667422 and parameters: {'learning_rate': 0.061964465469063076, 'n_estimators': 60, 'max_depth': 2, 'l2_leaf_reg': 3, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 21 with value: 0.580848964007205. [I 2022-04-18 22:08:56,163] Trial 28 finished with value: 0.4801843682475309 and parameters: {'learning_rate': 0.0503407250091752, 'n_estimators': 40, 'max_depth': 1, 'l2_leaf_reg': 4, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 21 with value: 0.580848964007205. [I 2022-04-18 22:08:59,779] Trial 29 finished with value: 0.4739489787480361 and parameters: {'learning_rate': 0.08088781144643682, 'n_estimators': 20, 'max_depth': 2, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 21 with value: 0.580848964007205. [I 2022-04-18 22:09:04,690] Trial 30 finished with value: 0.5761140832461468 and parameters: {'learning_rate': 0.07149490872329987, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 21 with value: 0.580848964007205. [I 2022-04-18 22:09:09,131] Trial 31 finished with value: 0.5729947975652463 and parameters: {'learning_rate': 0.0731307442555986, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 21 with value: 0.580848964007205. [I 2022-04-18 22:09:13,470] Trial 32 finished with value: 0.574733586444188 and parameters: {'learning_rate': 0.07066218296254434, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 21 with value: 0.580848964007205. [I 2022-04-18 22:09:18,591] Trial 33 finished with value: 0.5372622861231255 and parameters: {'learning_rate': 0.06463995389684278, 'n_estimators': 60, 'max_depth': 2, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 21 with value: 0.580848964007205. [I 2022-04-18 22:09:22,454] Trial 34 finished with value: 0.4803642243626388 and parameters: {'learning_rate': 0.051781810854661756, 'n_estimators': 40, 'max_depth': 1, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 21 with value: 0.580848964007205. [I 2022-04-18 22:09:27,060] Trial 35 finished with value: 0.5387615904278077 and parameters: {'learning_rate': 0.07356096913289784, 'n_estimators': 60, 'max_depth': 2, 'l2_leaf_reg': 5, 'loss_function': 'CrossEntropy', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 21 with value: 0.580848964007205. [I 2022-04-18 22:09:30,752] Trial 36 finished with value: 0.5762914585872532 and parameters: {'learning_rate': 0.09481995815006557, 'n_estimators': 40, 'max_depth': 1, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 21 with value: 0.580848964007205. [I 2022-04-18 22:09:35,563] Trial 37 finished with value: 0.5355233174779517 and parameters: {'learning_rate': 0.09466040171275283, 'n_estimators': 40, 'max_depth': 2, 'l2_leaf_reg': 4, 'loss_function': 'CrossEntropy', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 21 with value: 0.580848964007205. [I 2022-04-18 22:09:39,660] Trial 38 finished with value: 0.5469708311311969 and parameters: {'learning_rate': 0.0788796590711574, 'n_estimators': 40, 'max_depth': 1, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 21 with value: 0.580848964007205. [I 2022-04-18 22:09:43,135] Trial 39 finished with value: 0.485160908754256 and parameters: {'learning_rate': 0.08699284348966115, 'n_estimators': 20, 'max_depth': 1, 'l2_leaf_reg': 4, 'loss_function': 'CrossEntropy', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 21 with value: 0.580848964007205. [I 2022-04-18 22:09:47,144] Trial 40 finished with value: 0.5392403438568485 and parameters: {'learning_rate': 0.0963575797328132, 'n_estimators': 40, 'max_depth': 2, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 21 with value: 0.580848964007205. [I 2022-04-18 22:09:51,601] Trial 41 finished with value: 0.5746140598765365 and parameters: {'learning_rate': 0.09451399125862459, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 21 with value: 0.580848964007205. [I 2022-04-18 22:09:55,815] Trial 42 finished with value: 0.5719160743369323 and parameters: {'learning_rate': 0.09992707028871521, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 21 with value: 0.580848964007205. [I 2022-04-18 22:10:00,220] Trial 43 finished with value: 0.5822293889026711 and parameters: {'learning_rate': 0.07262743270268447, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:10:04,298] Trial 44 finished with value: 0.5569825159362765 and parameters: {'learning_rate': 0.0859595642071327, 'n_estimators': 40, 'max_depth': 1, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:10:08,912] Trial 45 finished with value: 0.5747943654072245 and parameters: {'learning_rate': 0.07653505602937954, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 1, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:10:13,079] Trial 46 finished with value: 0.5360627779635363 and parameters: {'learning_rate': 0.09145552260935805, 'n_estimators': 40, 'max_depth': 2, 'l2_leaf_reg': 3, 'loss_function': 'CrossEntropy', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:10:16,552] Trial 47 finished with value: 0.4833015866167635 and parameters: {'learning_rate': 0.06630626580703701, 'n_estimators': 20, 'max_depth': 1, 'l2_leaf_reg': 2, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:10:21,375] Trial 48 finished with value: 0.5475742704187475 and parameters: {'learning_rate': 0.0792128450850627, 'n_estimators': 60, 'max_depth': 4, 'l2_leaf_reg': 4, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:10:25,743] Trial 49 finished with value: 0.4857604650911954 and parameters: {'learning_rate': 0.018813502969378934, 'n_estimators': 40, 'max_depth': 1, 'l2_leaf_reg': 4, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:10:30,131] Trial 50 finished with value: 0.5708968717080309 and parameters: {'learning_rate': 0.08249511807994878, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 5, 'loss_function': 'CrossEntropy', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:10:34,627] Trial 51 finished with value: 0.5696373575802027 and parameters: {'learning_rate': 0.07545818168545484, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 1, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:10:38,985] Trial 52 finished with value: 0.5756925134555024 and parameters: {'learning_rate': 0.0734369783484098, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 1, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:10:43,712] Trial 53 finished with value: 0.49840939242608917 and parameters: {'learning_rate': 0.07195717057255843, 'n_estimators': 40, 'max_depth': 1, 'l2_leaf_reg': 2, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:10:48,244] Trial 54 finished with value: 0.5365428077328243 and parameters: {'learning_rate': 0.062975781609202, 'n_estimators': 60, 'max_depth': 2, 'l2_leaf_reg': 3, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:10:52,442] Trial 55 finished with value: 0.5719760623285479 and parameters: {'learning_rate': 0.05827427215173176, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 2, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:10:57,413] Trial 56 finished with value: 0.5671741287629566 and parameters: {'learning_rate': 0.08932174598442832, 'n_estimators': 40, 'max_depth': 1, 'l2_leaf_reg': 3, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:11:02,478] Trial 57 finished with value: 0.5716162062853465 and parameters: {'learning_rate': 0.08494023291677838, 'n_estimators': 80, 'max_depth': 1, 'l2_leaf_reg': 1, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:11:06,742] Trial 58 finished with value: 0.4830612930944599 and parameters: {'learning_rate': 0.044183366430914126, 'n_estimators': 40, 'max_depth': 2, 'l2_leaf_reg': 4, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:11:12,109] Trial 59 finished with value: 0.5502723997713374 and parameters: {'learning_rate': 0.06907217169496675, 'n_estimators': 60, 'max_depth': 5, 'l2_leaf_reg': 4, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:11:18,065] Trial 60 finished with value: 0.5493735326581313 and parameters: {'learning_rate': 0.0809580395595206, 'n_estimators': 80, 'max_depth': 3, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:11:22,305] Trial 61 finished with value: 0.5747943654072245 and parameters: {'learning_rate': 0.07648778224047231, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 1, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:11:26,384] Trial 62 finished with value: 0.5745541078381672 and parameters: {'learning_rate': 0.09400368522147726, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 1, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:11:30,590] Trial 63 finished with value: 0.5692179449843423 and parameters: {'learning_rate': 0.08812563855342259, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 1, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:11:35,844] Trial 64 finished with value: 0.5696373575802027 and parameters: {'learning_rate': 0.07564143037523466, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 1, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:11:40,057] Trial 65 finished with value: 0.5675940267276434 and parameters: {'learning_rate': 0.08245577000744238, 'n_estimators': 40, 'max_depth': 1, 'l2_leaf_reg': 2, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:11:43,548] Trial 66 finished with value: 0.4888779171565297 and parameters: {'learning_rate': 0.054083946905173264, 'n_estimators': 20, 'max_depth': 1, 'l2_leaf_reg': 1, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:11:48,743] Trial 67 finished with value: 0.5680780293306584 and parameters: {'learning_rate': 0.06563158488199358, 'n_estimators': 80, 'max_depth': 1, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:11:53,185] Trial 68 finished with value: 0.5695181545917689 and parameters: {'learning_rate': 0.09693230651572818, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:11:57,409] Trial 69 finished with value: 0.47520833108625543 and parameters: {'learning_rate': 0.06978729630426905, 'n_estimators': 40, 'max_depth': 1, 'l2_leaf_reg': 2, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:12:01,594] Trial 70 finished with value: 0.5361826820402749 and parameters: {'learning_rate': 0.09159219600124581, 'n_estimators': 40, 'max_depth': 2, 'l2_leaf_reg': 4, 'loss_function': 'CrossEntropy', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:12:06,302] Trial 71 finished with value: 0.5712570513304499 and parameters: {'learning_rate': 0.07843374427953949, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 1, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:12:10,477] Trial 72 finished with value: 0.5711361585394353 and parameters: {'learning_rate': 0.07399573884081889, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 1, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:12:15,087] Trial 73 finished with value: 0.5712570513304499 and parameters: {'learning_rate': 0.07840067773016057, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 1, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:12:19,329] Trial 74 finished with value: 0.5667607023826216 and parameters: {'learning_rate': 0.08461530118986406, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 1, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:12:23,869] Trial 75 finished with value: 0.5741941619118498 and parameters: {'learning_rate': 0.07181956236818035, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 3, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:12:28,602] Trial 76 finished with value: 0.5677778556764783 and parameters: {'learning_rate': 0.06139484336397184, 'n_estimators': 80, 'max_depth': 1, 'l2_leaf_reg': 1, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:12:32,902] Trial 77 finished with value: 0.5468509270544584 and parameters: {'learning_rate': 0.07690821019459478, 'n_estimators': 40, 'max_depth': 1, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:12:37,825] Trial 78 finished with value: 0.5384020220105775 and parameters: {'learning_rate': 0.06680774522467478, 'n_estimators': 60, 'max_depth': 2, 'l2_leaf_reg': 2, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:12:41,719] Trial 79 finished with value: 0.5429540445604536 and parameters: {'learning_rate': 0.08068914549651302, 'n_estimators': 40, 'max_depth': 1, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:12:46,661] Trial 80 finished with value: 0.5377424417287759 and parameters: {'learning_rate': 0.08659347329106441, 'n_estimators': 60, 'max_depth': 2, 'l2_leaf_reg': 5, 'loss_function': 'CrossEntropy', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:12:50,751] Trial 81 finished with value: 0.5738942579070176 and parameters: {'learning_rate': 0.07024710091672295, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:12:55,232] Trial 82 finished with value: 0.48132343899992447 and parameters: {'learning_rate': 0.034834557772921916, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:12:59,792] Trial 83 finished with value: 0.5714958707696512 and parameters: {'learning_rate': 0.07437952055586168, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:13:04,924] Trial 84 finished with value: 0.5714963201852312 and parameters: {'learning_rate': 0.07161307711953531, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:13:08,974] Trial 85 finished with value: 0.5425943323302378 and parameters: {'learning_rate': 0.08059573258633018, 'n_estimators': 40, 'max_depth': 1, 'l2_leaf_reg': 4, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:13:13,335] Trial 86 finished with value: 0.574793682295543 and parameters: {'learning_rate': 0.06812507993344234, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:13:17,187] Trial 87 finished with value: 0.48066398455448534 and parameters: {'learning_rate': 0.06425353053021227, 'n_estimators': 40, 'max_depth': 1, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:13:22,979] Trial 88 finished with value: 0.5483544199123461 and parameters: {'learning_rate': 0.07671719496909313, 'n_estimators': 80, 'max_depth': 3, 'l2_leaf_reg': 5, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:13:27,507] Trial 89 finished with value: 0.5758728189861904 and parameters: {'learning_rate': 0.06728314024504771, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 1, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:13:31,038] Trial 90 finished with value: 0.4840218200252392 and parameters: {'learning_rate': 0.05965862390423637, 'n_estimators': 20, 'max_depth': 1, 'l2_leaf_reg': 1, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:13:35,656] Trial 91 finished with value: 0.5696378069957827 and parameters: {'learning_rate': 0.06833331662357775, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 1, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:13:39,986] Trial 92 finished with value: 0.5713764161084927 and parameters: {'learning_rate': 0.06793321135504886, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 1, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:13:44,354] Trial 93 finished with value: 0.5774910746065816 and parameters: {'learning_rate': 0.07304902348149288, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 1, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:13:49,052] Trial 94 finished with value: 0.5472746180866401 and parameters: {'learning_rate': 0.09320410028210895, 'n_estimators': 60, 'max_depth': 4, 'l2_leaf_reg': 1, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:13:53,679] Trial 95 finished with value: 0.569697309618572 and parameters: {'learning_rate': 0.07506874947001548, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 1, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:13:58,093] Trial 96 finished with value: 0.5800703425265785 and parameters: {'learning_rate': 0.0726125195783933, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 1, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:14:02,139] Trial 97 finished with value: 0.5474504474381515 and parameters: {'learning_rate': 0.07312393222443428, 'n_estimators': 40, 'max_depth': 1, 'l2_leaf_reg': 1, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:14:07,894] Trial 98 finished with value: 0.5632227591240351 and parameters: {'learning_rate': 0.09600798091196183, 'n_estimators': 100, 'max_depth': 1, 'l2_leaf_reg': 1, 'loss_function': 'CrossEntropy', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711. [I 2022-04-18 22:14:12,433] Trial 99 finished with value: 0.5677191440251097 and parameters: {'learning_rate': 0.0887512877476902, 'n_estimators': 60, 'max_depth': 1, 'l2_leaf_reg': 2, 'loss_function': 'Logloss', 'eval_metric': 'Recall', 'early_stopping_rounds': 10}. Best is trial 43 with value: 0.5822293889026711.
Let's see what the best results were:
print(f"Number of completed trials: {len(catboost_study.trials)}\n")
print("Best trial:")
trial = catboost_study.best_trial
print("\tBest Score: {}".format(trial.value))
print("\n\tBest Hyperparameters:\n ")
for key, value in trial.params.items():
print("\t{}: {}".format(key, value))
Number of completed trials: 100 Best trial: Best Score: 0.5822293889026711 Best Hyperparameters: learning_rate: 0.07262743270268447 n_estimators: 60 max_depth: 1 l2_leaf_reg: 5 loss_function: Logloss eval_metric: Recall early_stopping_rounds: 10
from optuna.visualization import plot_contour
from optuna.visualization import plot_optimization_history
from optuna.visualization import plot_parallel_coordinate
from optuna.visualization import plot_param_importances
from optuna.visualization import plot_slice
fig = plot_optimization_history(catboost_study)
fig.show()
plot_param_importances(catboost_study)
Disconsidering eval_metric and early_stopping_rounds (which were not being tested), the hyperparameter which has the most weight on the test's result is the learning rate, followed closely by the number of estimators.
# We have the hyperparameters nicely stored in this variable
trial.params
{'learning_rate': 0.07262743270268447,
'n_estimators': 60,
'max_depth': 1,
'l2_leaf_reg': 5,
'loss_function': 'Logloss',
'eval_metric': 'Recall',
'early_stopping_rounds': 10}
Now we can use these hyperparameters with our model
model = CatBoostClassifier(**trial.params, random_state = 42, task_type = "GPU", verbose = False)
model.fit(X_train_std,y_train)
y_pred=model.predict(X_test_std)
Confusion matrix with the final results of our model:
cm=confusion_matrix(y_test,y_pred)
plot_confusion_matrix(conf_mat=cm,cmap="coolwarm_r")
plt.show()
Evaluation metrics for our model:
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1-Score:", f1_score(y_test, y_pred))
print("AUC:", roc_auc_score(y_test, y_pred))
Accuracy: 0.5690817186183657 Precision: 0.5695400716056183 Recall: 0.578623391158366 F1-Score: 0.5740458015267175 AUC: 0.5690467575859474
ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test_std)[:,1])
auc = roc_auc_score(y_test, y_pred)
plt.figure(figsize=(9, 9))
plt.plot(fpr, tpr, label="%s ROC (AUC = %0.2f)" % ("CatBoost", auc))
plt.plot([0, 1], [0, 1],"r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positives",fontsize = 15)
plt.ylabel("True Positives",fontsize = 15)
plt.title("ROC-AUC curve",fontsize = 15)
plt.legend()
plt.show()
We tested different models and optimization techniques in order to obtain the best possible result (best Recall, specially) with the available data for the problem at hands. Still, the result is far from satisfactory, since the evaluation metrics are still low, and it wouldn't be a good idea to use this model in production. Despite this, it was interesting to practice the techniques used.
Some considerations:
# Quick test without the columns possibly causing data leakage
# Creating a new X_train and X_test
X_train2 = X_train.drop(columns=["intubed","patient_type","icu"])
X_test2 = X_test.drop(columns=["intubed","patient_type","icu"])
# Running the model
model2 = CatBoostClassifier(**trial.params, random_state = 42, task_type = "GPU", verbose = False)
model2.fit(X_train2,y_train)
y_pred2=model2.predict(X_test2)
# Evaluation metrics
print("Accuracy:", accuracy_score(y_test, y_pred2))
print("Precision:", precision_score(y_test, y_pred2))
print("Recall:", recall_score(y_test, y_pred2))
print("F1-Score:", f1_score(y_test, y_pred2))
print("AUC:", roc_auc_score(y_test, y_pred2))
Accuracy: 0.5635355237292895 Precision: 0.5619098284346322 Recall: 0.5910744264129827 F1-Score: 0.5761232699256834 AUC: 0.5634346201963447
Recall does increase, slightly, but at the cost of Precision (Less False Negatives, more False Positives).